13 research outputs found
Extracting structured information from 2D images
Convolutional neural networks can handle an impressive array of supervised learning tasks while relying on a single backbone architecture, suggesting that one solution fits all vision problems. But for many tasks, we can directly make use of the problem structure within neural networks to deliver more accurate predictions. In this thesis, we propose novel deep learning components that exploit the structured output space of an increasingly complex set of problems. We start from Optical Character Recognition (OCR) in natural scenes and leverage the constraints imposed by a spatial outline of letters and language requirements. Conventional OCR systems do not work well in natural scenes due to distortions, blur, or letter variability. We introduce a new attention-based model, equipped with extra information about the neuron positions to guide its focus across characters sequentially. It beats the previous state-of-the-art benchmark by a significant margin. We then turn to dense labeling tasks employing encoder-decoder architectures. We start with an experimental study that documents the drastic impact that decoder design can have on task performance. Rather than optimizing one decoder per task separately, we propose new robust layers for the upsampling of high-dimensional encodings. We show that these better suit the structured per pixel output across the board of all tasks. Finally, we turn to the problem of urban scene understanding. There is an elaborate structure in both the input space (multi-view recordings, aerial and street-view scenes) and the output space (multiple fine-grained attributes for holistic building understanding). We design new models that benefit from a relatively simple cuboidal-like geometry of buildings to create a single unified representation from multiple views. To benchmark our model, we build a new multi-view large-scale dataset of buildings images and fine-grained attributes and show systematic improvements when compared to a broad range of strong CNN-based baselines
Rethinking the Inception Architecture for Computer Vision
Convolutional networks are at the core of most state-of-the-art computer
vision solutions for a wide variety of tasks. Since 2014 very deep
convolutional networks started to become mainstream, yielding substantial gains
in various benchmarks. Although increased model size and computational cost
tend to translate to immediate quality gains for most tasks (as long as enough
labeled data is provided for training), computational efficiency and low
parameter count are still enabling factors for various use cases such as mobile
vision and big-data scenarios. Here we explore ways to scale up networks in
ways that aim at utilizing the added computation as efficiently as possible by
suitably factorized convolutions and aggressive regularization. We benchmark
our methods on the ILSVRC 2012 classification challenge validation set
demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6%
top-5 error for single frame evaluation using a network with a computational
cost of 5 billion multiply-adds per inference and with using less than 25
million parameters. With an ensemble of 4 models and multi-crop evaluation, we
report 3.5% top-5 error on the validation set (3.6% error on the test set) and
17.3% top-1 error on the validation set
Deep Learning for Rheumatoid Arthritis: Joint Detection and Damage Scoring in X-rays
Recent advancements in computer vision promise to automate medical image
analysis. Rheumatoid arthritis is an autoimmune disease that would profit from
computer-based diagnosis, as there are no direct markers known, and doctors
have to rely on manual inspection of X-ray images. In this work, we present a
multi-task deep learning model that simultaneously learns to localize joints on
X-ray images and diagnose two kinds of joint damage: narrowing and erosion.
Additionally, we propose a modification of label smoothing, which combines
classification and regression cues into a single loss and achieves 5% relative
error reduction compared to standard loss functions. Our final model obtained
4th place in joint space narrowing and 5th place in joint erosion in the global
RA2 DREAM challenge.Comment: Presented at the Workshop on AI for Public Health at ICLR 202
The Devil is in the Decoder: Classification, Regression and GANs
Many machine vision applications, such as semantic segmentation and depth
prediction, require predictions for every pixel of the input image. Models for
such problems usually consist of encoders which decrease spatial resolution
while learning a high-dimensional representation, followed by decoders who
recover the original input resolution and result in low-dimensional
predictions. While encoders have been studied rigorously, relatively few
studies address the decoder side. This paper presents an extensive comparison
of a variety of decoders for a variety of pixel-wise tasks ranging from
classification, regression to synthesis. Our contributions are: (1) Decoders
matter: we observe significant variance in results between different types of
decoders on various problems. (2) We introduce new residual-like connections
for decoders. (3) We introduce a novel decoder: bilinear additive upsampling.
(4) We explore prediction artifacts
Holistic Multi-View Building Analysis in the Wild with Projection Pooling
We address six different classification tasks related to fine-grained
building attributes: construction type, number of floors, pitch and geometry of
the roof, facade material, and occupancy class. Tackling such a remote building
analysis problem became possible only recently due to growing large-scale
datasets of urban scenes. To this end, we introduce a new benchmarking dataset,
consisting of 49426 images (top-view and street-view) of 9674 buildings. These
photos are further assembled, together with the geometric metadata. The dataset
showcases various real-world challenges, such as occlusions, blur, partially
visible objects, and a broad spectrum of buildings. We propose a new projection
pooling layer, creating a unified, top-view representation of the top-view and
the side views in a high-dimensional space. It allows us to utilize the
building and imagery metadata seamlessly. Introducing this layer improves
classification accuracy -- compared to highly tuned baseline models --
indicating its suitability for building analysis.Comment: Accepted for publication at the 35th AAAI Conference on Artificial
Intelligence (AAAI 2021